Data Visualization with R

MSDA - Bootcamp 2025 Summer

Author

KT Wong

Published

August 21, 2025

Abstract

The materials in this topic are drawn from Imai and Williams (2022), Wickham and Grolemund (2023), Wickham (2019) and Wickham (2016) as well as other sources, including Princeton Sociology Methods Camp 2023. The materials are for educational purposes only.

ggplot2

it starts from the grammar of graphics Wickham (2016)

  • data
  • aesthetics
  • geoms
  • facets
  • stats
  • scales
  • coordinates
  • themes
  • Aesthetics
    • specifiy how we want our data to map onto our plot
      • Which variable belongs on the x-axis? What about the y-axis?
      • Are we going to convey additional dimenisons of data with colour, or shape, or opacity?
  • Scale
    • When setting scales, we need to allow for easy data visualisation
    • Most of the time we’ll use a linear scale
    • but can also use other options such as geometric, or logarithmic, if the data is distributed differently and would better suit these transformations
  • Geoms
    • Geoms are the actual visual elements that we use to represent our data
      • Points, lines, bars, etc.
      • Geoms are the building blocks of our plot
  • Statistics
    • We need to think about summarising our data
    • Statistics are used to summarise the data
  • Facets
    • Facets allow us to create multiple plots that each display a subset of the data
    • This is useful for comparing different groups or categories within the data

ggplot2

  • Every ggplot2 plot has three key components:
    • data
    • A set of aesthetic mappings between variables in the data and visual properties
    • At least one layer which describes how to render each observation
      • Layers are usually created with a geom function

ggplot2 - data illustration

  • Use built-in dataset from ggplot2: mpg
    • information about the fuel economy of popular car models in 1999 and 2008
    • collected by the US Environmental Protection Agency
    • here are some of the variables in the dataset:
Variable Description
manufacturer Car manufacturer
model Car model
year Year of manufacture
displ Engine displacement (litres)
hwy Miles per gallon (highway)
cty Miles per gallon (city)
cyl Number of cylinders
drv Drive type (f = front, r = rear, 4 = 4wd)
class Type of car
trans Type of transmission
fl Fuel type
    + manufacturer, model, year
    + displ (engine displacement in litres)
    + hwy (miles per gallon on the highway)
    + cty (miles per gallon in the city)
    + cyl (number of cylinders)
    + drv (f = front-wheel drive, r = rear wheel drive, 4 = 4wd)
    + class (type of car)
    + trans (type of transmission)
    + fl (fuel type)
    
  • The mpg dataset is a tibble, a modern version of a data frame

  • The mpg dataset is part of the ggplot2 package

  • The mpg dataset is a tidy dataset

  • This dataset suggests many interesting questions

    • How are engine size and fuel economy related?
    • Do certain manufacturers care more about fuel economy than others?
    • Has fuel economy improved in the last ten years?
  • List five functions that you could use to get more information about the mpg dataset

  • How can you find out what other datasets are included with ggplot2?

  • Apart from the US, most countries use fuel consumption (fuel consumed over fixed distance) rather than fuel economy (distance travelled with fixed amount of fuel). How could you convert cty and hwy into the European standard of l/100km?

  • Which manufacturer has the most models in this dataset?

    • Which model has the most variations?
    • Does your answer change if you remove the redundant specification of drive train (e.g. “pathfinder 4wd”, “a4 quattro”) from the model name?

ggplot2

  • Let us plot the relationship between engine size and fuel economy

. . .

Code
library(ggplot2)

data(mpg)

ggplot(data=mpg, 
       mapping=aes(x=displ, y=hwy)) +
  geom_point()

. . .

  • How would you describe the relationship between displ and hwy?
Code
ggplot(mpg, 
       aes(cty, hwy)) +
  geom_point()

Code
ggplot(diamonds, 
       aes(carat, price)) +
  geom_point()

Code
ggplot(economics, 
       aes(date, unemploy)) +
  geom_line()

Code
ggplot(mpg, 
       aes(cty)) +
  geom_histogram()

ggplot2

Colour, size, shape and other aesthetic attributes

  • Aesthetics are visual properties of the objects in the plot
    • colour, size, shape, linetype, fill, alpha
  • Aesthetics can be mapped to variables in the data
    • aes(colour=variable)
    • aes(size=variable)
    • aes(shape=variable)
    • aes(linetype=variable)
    • aes(fill=variable)
    • aes(alpha=variable)

ggplot2

Colour, size, shape and other aesthetic attributes

. . .

Code
ggplot(mpg, 
       aes(displ, hwy, 
           colour=class)) +
  geom_point()

. . .

Code
ggplot(mpg, 
       aes(trans, hwy,
           colour=class)) +
  geom_point()

  • ggplot2 takes care of the details of converting data (e.g., ‘f’, ‘r’, ‘4’) into aesthetics (e.g., ‘red’, ‘yellow’, ‘green’) with a scale

    • There is one scale for each aesthetic mapping in a plot.
    • The scale is also responsible for creating a guide, an axis or legend, that allows you to read the plot, converting aesthetic values back into data values
  • The scale functions are:

    • scale_colour_manual()
    • scale_size_manual()
    • scale_shape_manual()
    • scale_linetype_manual()
    • scale_fill_manual()
    • scale_alpha_manual()
  • What happens when you map them to continuous values?

  • What about categorical values?

  • What happens when you use more than one aesthetic in a plot?

ggplot2 — labels

  • Labels are important for making your plot understandable
    • xlab() and ylab() functions
    • labs() function

. . .

Code
ggplot(mpg, 
       aes(displ, hwy)) +
  geom_point(aes(color=class)) +
  labs(x="Engine size (litres)",
       y="Highway fuel economy (miles per gallon)",
       title="Relationship between engine size and fuel economy",
       color="Car type",
       caption="Source: mpg dataset")+
  theme_bw()

ggplot2

ggthemes

Code
library(ggthemes)

ggplot(mpg, 
       aes(displ, hwy)) +
  geom_point(aes(color=class)) +
  labs(x="Engine size (litres)",
       y="Highway fuel economy (miles per gallon)",
       title="Relationship between engine size and fuel economy",
       color="Car type",
       caption="Source: mpg dataset")+
  theme_economist()+
  scale_color_tableau() +
  theme(
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

ggplot2 — Facets

  • Facets allow you to create multiple plots that each display a subset of the data
    • facet_wrap() creates a grid of plots
    • facet_grid() creates a matrix of plots

. . .

Code
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_wrap(~class)

ggplot2

Plot geoms

  • Geoms are the geometric objects that represent the data in the plot
    • geom_point() creates a scatterplot
    • geom_smooth() creates a smoothed line plot
    • geom_histogram() creates a histogram
    • geom_boxplot() creates a boxplot
    • geom_bar() creates a bar plot
    • geom_line() creates a line plot
    • geom_vline() adds a vertical line to the plot
    • geom_hline() adds a horizontal line to the plot
    • geom_abline() adds a diagonal line to the plot

ggplot2

Adding a smoother to a plot

Code
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(span=0.3)

ggplot2

ggplot2

Boxplots

Code
ggplot(mpg, aes(class, hwy)) +
  geom_boxplot()+
  labs(title="Highway fuel economy by car type",
       x="Car type",
       y="Highway fuel economy (miles per gallon)")+
  coord_flip()+
  theme_economist()

ggplot2

Bar plots

  • Bar plots are useful for visualizing the distribution of a categorical variable
Code
ggplot(mpg, aes(class)) +
  geom_bar()

Code
ggplot(mpg, aes(class, fill=drv)) +
  geom_bar()

ggplot2

Histograms and density plots

  • Histograms and density plots are useful for visualizing the distribution of a continuous variable
Code
ggplot(mpg, aes(hwy)) +
  geom_histogram() 

Code
ggplot(mpg, aes(hwy)) +
  geom_density()

ggplot2

Histograms and density plots

Code
den<- ggplot(mpg, aes(displ, colour = drv)) + 
  geom_density(linewidth=0.8)
  
hist<- ggplot(mpg, aes(displ, fill = drv)) + 
  geom_histogram(binwidth = 0.5) + 
  facet_wrap(~drv, ncol = 1)

ggarrange(den, hist, ncol=2)

ggplot2

ggsave - save the graph as an image file

Code
ggsave(filename="mpg_displ.png",width=6, height=4)

Final Example - toy imports to the US from 1996-2005

  • it is drawn from Scott (2021)
Code
library(tidyverse)

toy_imports <- read_csv("https://raw.githubusercontent.com/kwan-MSDA/Bootcamp_2024/main/dataset/toyimports.csv")

head(toy_imports)
# A tibble: 6 × 8
  partner  year partner_name       product product_name US_report_import pop2000
  <chr>   <dbl> <chr>                <dbl> <chr>                   <dbl>   <dbl>
1 ARE      1998 United Arab Emira…  950341 "Toys repre…             1.06  3.25e6
2 ARE      2000 United Arab Emira…  950349 "Toys repre…            12.0   3.25e6
3 ARE      2003 United Arab Emira…  950349 "Toys repre…             4.65  3.25e6
4 ARE      2005 United Arab Emira…  950320 "Reduced-si…            49.2   3.25e6
5 ARG      1996 Argentina           950341 "Toys repre…             0     3.69e7
6 ARG      1996 Argentina           950310 "Electric t…            10.8   3.69e7
# ℹ 1 more variable: region <dbl>

. . .

  • Task: make a graph showing total toy imports over time for the U.S.’s top 5 trading partners by total dollar value of toys imported

Final Example - toy imports to the US from 1996-2005

Code
country_total<- toy_imports %>% 
  group_by(partner_name) %>%
  summarize(total_import=sum(US_report_import)) %>%
  arrange(desc(total_import)) %>%
  head(5)

country_total
# A tibble: 5 × 2
  partner_name     total_import
  <chr>                   <dbl>
1 China               26842305.
2 Denmark              1034990.
3 Canada                572309.
4 Hong Kong, China      545186.
5 Switzerland           400969.
  • the total dollar value of toys imported to the U.S. (US_report_import, in multiples of $1,000) in a specific product category from a specific country in a specific year

  • The product categories have unique numerical codes (product) as well as product names exciting enough to quicken the heart of any toy-loving child (“Parts and accessories :– Other,” “Toys representing animal or non-human figures,” and so on

  • Group all the observations by trading partner (the partner_name variable)

  • For each partner, calculate total dollar value by summing toy imports (US_report_import) across all categories and years

  • Arrange the partners by total dollar value

Final Example - toy imports to the US from 1996-2005

Code
#| out-width: 100%

top5_partners=c("China", "Denmark", "Canada", "Hong Kong, China", "Switzerland")

options(scipen = 999)

library(ggthemes)
library(scales)
library(plotly)

p <- toy_imports %>% 
  filter(partner_name %in% top5_partners) %>%
  group_by(year, partner_name) %>%
  summarize(total_import=sum(US_report_import)) %>% 
  ggplot(aes(year, total_import, color=partner_name)) +
  geom_line()+
  labs(title="Toy imports from the U.S.'s top-5 partners, 1996-2005",
       x="Year",
       y="Dollar value of imports (log scale)",
       color="Import Region")+
  scale_x_continuous(breaks=1996:2005)+
  theme_economist()+ 
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x)))

ggplotly(p)

the five coldest months in Rapid City from 1995 to 2011

Code
library(tidyverse)

rapidcity <- read_csv("https://raw.githubusercontent.com/kwan-MSDA/Bootcamp_2024/main/dataset/rapidcity.csv")

rapidcity %>% 
  group_by(Year, Month) %>%
  summarize(avg_Temp = mean(Temp),
            lowest_temp = min(Temp),
            hightest_temp = max(Temp)) %>%
  arrange(avg_Temp) %>%
  head(5) %>% 
  round(1)
# A tibble: 5 × 5
# Groups:   Year [4]
   Year Month avg_Temp lowest_temp hightest_temp
  <dbl> <dbl>    <dbl>       <dbl>         <dbl>
1  1996     1     14.9       -11            46.1
2  2009    12     16.4        -2.6          35.6
3  2000    12     17.3        -9            38.8
4  1996    12     17.5       -10.8          40.4
5  2001     2     17.6        -3.9          40.8
  • Import the data set (we’ve done this already).
  • Split the data set into individual months in individual years: January 1995, February 1995, March 1995, and so on, all the way through December 2011.
  • For each individual month, calculate the average of the Temp variable (along with any other summaries we might find interesting).
  • Sort the individual months according to their average temperatures.
  • Make a table of the five coldest months

survival on the Titanic

Q: how did survival among adult passengers vary by sex and cabin class?

Code
titanic <- read_csv("https://raw.githubusercontent.com/kwan-MSDA/Bootcamp_2024/main/dataset/titanic.csv")

head(titanic)
# A tibble: 6 × 5
  name                            survived sex       age passengerClass
  <chr>                           <chr>    <chr>   <dbl> <chr>         
1 Allen, Miss. Elisabeth Walton   yes      female 29     1st           
2 Allison, Master. Hudson Trevor  yes      male    0.917 1st           
3 Allison, Miss. Helen Loraine    no       female  2     1st           
4 Allison, Mr. Hudson Joshua Crei no       male   30     1st           
5 Allison, Mrs. Hudson J C (Bessi no       female 25     1st           
6 Anderson, Mr. Harry             yes      male   48     1st           
Code
surv_adults<- titanic %>% 
  mutate(Adult = age >= 18) %>%
  filter(Adult) %>%
  group_by(sex, passengerClass) %>%
  summarize(total_count=n(),
            survived = sum(survived=="yes"),
            survival_rate = survived/total_count)


surv_adults
# A tibble: 6 × 5
# Groups:   sex [2]
  sex    passengerClass total_count survived survival_rate
  <chr>  <chr>                <int>    <int>         <dbl>
1 female 1st                    125      121        0.968 
2 female 2nd                     85       74        0.871 
3 female 3rd                    106       47        0.443 
4 male   1st                    144       47        0.326 
5 male   2nd                    143       12        0.0839
6 male   3rd                    289       45        0.156 
Code
library(ggthemes)

ggplot(surv_adults) +
  geom_col(aes(x=sex, y=survival_rate)) +
  facet_wrap(~passengerClass, nrow=1)+
  labs(title="Survival rate by gender and passenger class",
       y="Survival rate",
       x="gender")+
  theme_economist()

  • create a new variable, which we’ll call Adult, that determines whether a passenger is at least 18 years old.
  • filter the data set down to adults only.
  • group the filtered data set by sex and cabin class (2 sexes× × 3 classes = 6 groups).
  • calculate the survival percentage for each group.

Extra: Gapminder data

Code
library(gapminder)

data(gapminder)

gapminder %>% 
  group_by(year, continent) %>%
  mutate(median_lifeExp = median(lifeExp)) %>%
  ggplot(aes(year, median_lifeExp, color=continent)) +
  geom_line()+
  labs(title="Life expectancy by continent and year",
       x="Year",
       y="Life expectancy")+
  theme_economist()

Code
ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.colour = "hotpink") +
  geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1 / 4)

Extra: Gapminder data

this is from BBC style

Code
# install.packages('devtools')
#devtools::install_github('bbc/bbplot'))

library(ggpubr)

source("https://raw.githubusercontent.com/kwan-MSDA/R/main/bbc_style.R")

gapminder %>% 
  group_by(year, continent) %>%
  summarize(median_lifeExp = median(lifeExp)) %>%
  ggplot(aes(year, median_lifeExp, color=continent)) +
  geom_line()+
  labs(title="Life expectancy by continent and year",
       x="Year",
       y="Life expectancy")+
  bbc_style()

Extra: Gapminder data

Code
library("ggalt")
library("tidyr")
 
library(gapminder)

dumbbell_df <- gapminder %>%
  filter(year == 1967 | year == 2007) %>%
  select(country, year, lifeExp) %>%
  spread(year, lifeExp) %>%
  mutate(gap = `2007` - `1967`) %>%
  arrange(desc(gap)) %>%
  head(10)
 
#Make plot
ggplot(dumbbell_df, aes(x = `1967`, xend = `2007`, y = reorder(country, gap), group = country)) + 
  geom_dumbbell(colour = "#dddddd",
                size = 3,
                colour_x = "#FAAB18",
                colour_xend = "#1380A1") +
  bbc_style() + 
  labs(title="We're living longer",
       subtitle="Biggest life expectancy rise, 1967-2007")

Extra: Gapminder data

Code
library(hrbrthemes)
library(viridis)

gapminder %>% 
  filter(year==2007) %>%
  mutate(country=factor(country, levels=unique(country))) %>%
  arrange(desc(pop)) %>% 
  ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, fill=continent)) +
  geom_point(alpha=0.6, shape=21, color="black")+
  scale_size(range=c(.1, 24), name="Population (M)")+
  scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A")+
  theme_ipsum()+
  theme(legend.position="none")+
  labs(title="Life expectancy by continent in 2007",
       x="GDP per capita",
       y="Life Expectancy")

Extra: Gapminder data

Code
library(gganimate)

gapminder %>% 
  ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, fill=continent, frame=year)) +
  geom_point(alpha=0.6, shape=21, color="black")+
  scale_size(range=c(.1, 22), name="Population (M)")+
  scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A")+
  theme_ipsum()+
  theme(legend.position="none")+
  labs(title="Life expectancy by continent in {frame_time}",
       x="GDP per capita",
       y="Life Expectancy")+
  geom_text(data=gapminder %>%  filter(pop >1e+8), aes(label=country), size=5, nudge_x=0.1, nudge_y=0.1)+
  transition_time(year)+
  enter_fade()+
  exit_fade()

Code
anim_save("gapminder_gganimate.gif")

Extra: Gapminder data

source

Code
library(plotly)
library(hrbrthemes)
library(viridis)

g<- crosstalk::SharedData$new(gapminder %>% 
                              mutate(country=factor(country, levels=unique(country))) %>%
                              arrange(desc(pop)),
                              ~ continent)
gg<- g %>% 
  ggplot(aes(x=gdpPercap, y=lifeExp, fill=continent, frame=year)) +
  geom_point(aes(size=pop, alpha=0.6, ids=country))+
  scale_size(range=c(.1, 24), name="Population (M)")+
  scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A")+
  scale_alpha(range=c(0.6, 1), guide=FALSE)+
  theme_ipsum()+
  # theme(legend.position="none")+
  labs(title="Life expectancy by continent between 1952-2007",
       x="GDP per capita",
       y="Life Expectancy")

ggplotly(gg, height = 500, width = 800)

References

Imai, Kosuke, and Nora Webb Williams. 2022. Quantitative Social Science : An Introduction in Tidyverse. Princeton, New Jersey: Princeton University Press.
Scott, James. 2021. “Data Science in r: A Gentle Introduction.” 2021. https://bookdown.org/jgscott/DSGI/.
Wickham, Hadley. 2016. Ggplot2 : Elegrant Graphics for Data Analysis. Second edition. Use r! Switzerland: Springer.
———. 2019. Advanced r. Second edition. The r Series. Boca Raton, FL: CRC Press.
Wickham, Hadley, and Garrett Grolemund. 2023. R for Data Science : Import, Tidy, Transform, Visualize, and Model Data. Second edition. Beijing: O’Reilly.